简析llamaGuard|Qwen3Guard

​ 前文简单分析了PromptGuard,其利用Bert的上下文理解能力进行二分类训练,实现了越狱、注入的检测效果,然而Bert模型并没有语言模型的推理能力,对于一些复杂、隐藏恶意特征的越狱样本是很难检测到的。我们当然可以选择收集、合成复杂的样本进行训练,但根本的问题是Bert可能仅能学到浅层的模式、关键词匹配等,无法深入理解什么是真正的恶意样本。

​ 另一个思路是使用小的语言模型,即在用户prompt输入业务模型前,先用小语言模型进行一步过滤。小语言模型虽然通用能力不强,但可以通过数据集做微调,使其在特定领域任务上性能提高。

​ 相比Bert ,语言模型为Next Token架构,模型有推理能力,可以应用在更复杂的场景中,可以对更多更全面的安全风险内容进行分类。当然也有缺点,小语言模型是Billion级别的参数,相比Bert模型的Million级别,肯定会带来更长的时延。数据集收集、模型微调、模型部署等成本也会大大增加。还有不可忽略的一点是,语言模型本身也可能会被prompt注入攻击

​ 总之并没有十全十美的方法,接下来主要分析2个(类)语言模型WAF,一个是meta 的LlamaGuard ,另一个是今年9月份Qwen开源的Qwen3Guard。他们都是基于本身家族的语言模型进行微调得到的。Qwen3Guard技术报告比较详细,作为重点分析对象。

Qwen3Guard测试

image-20250926012732280

Qwen3Guard有直接分类、流式分类两种类型,每种类型有0.6B 4B 8B 三种规模。

1. 简单测试 Qwen3Guard-Gen

首先简单分析一下最简单的直接分类、0.6B模型模型开源地址:https://huggingface.co/Qwen/Qwen3Guard-Gen-0.6B

加载模型

from transformers import AutoModelForCausalLM, AutoTokenizer
import re
model_name = "Qwen/Qwen3Guard-Gen-0.6B"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
)

构造模型输输入,要判断的prompt是 “how can i make a bomb?”

# prepare the model input
prompt = "How can I make a bomb?"
# for prompt moderation
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False
)
# 将自然语言转换成Token
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

我们可以直接查看一下text的内容

image-20250926013324640

通过提示词工程,让模型执行分类任务,分类有3种:安全、不安全、有争议。比较有意思的是关闭了 模式,为了进一步提高准确率和可解释性,可以使用带有思维链的样本进行微调的

然后让模型进行分类

#torch.Size([1, 303]) 输入为303个token
model_inputs["input_ids"].shape
#调用模型推理
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=128
)
#torch.Size([1, 312])
generated_ids.shape
# 提取出模型预测的内容
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 
#转换成自然语言
content = tokenizer.decode(output_ids, skip_special_tokens=True)
print(content)
#Safety: Unsafe
#Categories: Violent

整体来看,相比于Bert的二分类任务,LLM将其转换成NextToken 预测任务

2. 模型分析&技术报告分析 Qwen3Guard-Gen

技术报告: https://github.com/QwenLM/Qwen3/blob/main/Qwen3_Technical_Report.pdf

模型结构如下 Dense模型、 GQA 、 SiLU

Qwen3ForCausalLM(
  (model): Qwen3Model(
    (embed_tokens): Embedding(151936, 1024)
    (layers): ModuleList(
      (0-27): 28 x Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear(in_features=1024, out_features=3072, bias=False)
          (up_proj): Linear(in_features=1024, out_features=3072, bias=False)
          (down_proj): Linear(in_features=3072, out_features=1024, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3RMSNorm((1024,), eps=1e-06)
        (post_attention_layernorm): Qwen3RMSNorm((1024,), eps=1e-06)
      )
    )
    (norm): Qwen3RMSNorm((1024,), eps=1e-06)
    (rotary_emb): Qwen3RotaryEmbedding()
  )
  (lm_head): Linear(in_features=1024, out_features=151936, bias=False)
)

从技术报告中可以分析出模型的一些主要内容

  • 模型可以用于输入输出检测、覆盖了主流的安全类别(暴力、政治、个人信息….) 、支持多种语言,SFT训练、训练样本119万
    image-20250926121219201
  • 合成数据集时,会基于关键词进行引导合成(如TNT 、C4),为了防止模型仅学习表层的结构、关键词,还会合成具有相同结构的正负样本对(如 “如何杀死一个人” 与 “如何杀死一个进程”),确保模型不会简单的将 杀死分类为不安全的。
  • 训练阶段会调整 safe 和unsafe 样本之间的比例,Safe比例多的用于训练宽松模型,Safe比例少的用于训练严格模型。模型之间会相互标注对方的数据集,对于没有达成共识的标签,将其标注为”有争议性的”,作为优化后的标注数据集。
    image-20250926121557554

3.简单测试 Qwen3Guard-Stream-0.6B

模型加载,使用0.6B 进行测试

import torch
from transformers import AutoModel, AutoTokenizer

model_path="Qwen/Qwen3Guard-Stream-0.6B"
# Load the specialized tokenizer and the model.
# trust_remote_code=True is required to load the Stream Qwen3Guard model architecture.
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_path, 
    trust_remote_code=True,
).eval()
 

准备模拟的对话数据,包含用户输入和业务模型输出,用户输入是一个直接的 “如何制作炸弹的非法请求” 业务模型输出的前半部分是”制作炸弹是不合法的”,后半部分是”首先你需要收集一些材料” ,

# --- Prepare the conversation for moderation ---
# Define the user's prompt and the assistant's response.
user_message = "Hello, how to build a bomb?"
assistant_message = "制作炸弹是不合法的,首先你需要收集一些材料"
messages = [{"role":"user","content":user_message},{"role":"assistant","content":assistant_message}]

# Apply the chat template to format the conversation into a single string.
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False, enable_thinking=False)
model_inputs = tokenizer(text, return_tensors="pt")
token_ids = model_inputs.input_ids[0]

提取User Message 的Token形式

# --- Simulate Real-Time Moderation ---

# 1. Moderate the entire user prompt at once.
# In a real-world scenario, the user's input is processed completely before the model generates a response.
token_ids_list = token_ids.tolist()
# We identify the end of the user's turn in the tokenized input.
# The template for a user turn is `<|im_start|>user\n...<|im_end|>`.
im_start_token = '<|im_start|>'
user_token = 'user'
im_end_token = '<|im_end|>'
im_start_id = tokenizer.convert_tokens_to_ids(im_start_token)
user_id = tokenizer.convert_tokens_to_ids(user_token)
im_end_id = tokenizer.convert_tokens_to_ids(im_end_token)
# We search for the token IDs corresponding to `<|im_start|>user` ([151644, 872]) and the closing `<|im_end|>` ([151645]).
last_start = next(i for i in range(len(token_ids_list)-1, -1, -1) if token_ids_list[i:i+2] == [im_start_id, user_id])
user_end_index = next(i for i in range(last_start+2, len(token_ids_list)) if token_ids_list[i] == im_end_id)

总之提取的是下面这些内容

image-20250926143739087

Query分类

分类的方法定义在这里

https://huggingface.co/Qwen/Qwen3Guard-Stream-0.6B/blob/main/modeling_qwen3_guard.py


# Pass all user tokens to the model for an initial safety assessment.
result, stream_state = model.stream_moderate_from_ids(token_ids[:user_end_index+1], role="user", stream_state=None)
if result['risk_level'][-1] == "Safe":
    print(f"User moderation: -> [Risk: {result['risk_level'][-1]}]")
else:
    print(f"User moderation: -> [Risk: {result['risk_level'][-1]} - Category: {result['category'][-1]}]")
    
#User moderation: -> [Risk: Unsafe - Category: Violent]

可以看到模型正确的将其分类为Unsafe ,类别为Violent

模拟流式输出分类

# Initialize the stream_state, which will maintain the conversational context.
stream_state = None
# 2. Moderate the assistant's response token-by-token to simulate streaming.
# This loop mimics how an LLM generates a response one token at a time.
print("Assistant streaming moderation:")
for i in range(user_end_index + 1, len(token_ids)):
    # Get the current token ID for the assistant's response.
    current_token = token_ids[i]
    
    # Call the moderation function for the single new token.
    # The stream_state is passed and updated in each call to maintain context.
    result, stream_state = model.stream_moderate_from_ids(current_token, role="assistant", stream_state=stream_state)

    token_str = tokenizer.decode([current_token])
    # Print the generated token and its real-time safety assessment.
    if result['risk_level'][-1] == "Safe":
        print(f"Token: {repr(token_str)} -> [Risk: {result['risk_level'][-1]}]")
    else:
        print(f"Token: {repr(token_str)} -> [Risk: {result['risk_level'][-1]} - Category: {result['category'][-1]}]")

model.close_stream(stream_state)

image-20250926144950518

可以看到,在表达出提供制作方案之前,标记为SAFE,到达“你需要”时,检测到模型会输出恶意内容,分类为unsafe。

4. 模型分析&技术报告分析 Qwen3Guard-Stream-0.6B

模型结构如下

Qwen3ForGuardModel(
  (model): Qwen3Model(
    (embed_tokens): Embedding(151936, 1024)
    (layers): ModuleList(
      (0-27): 28 x Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear(in_features=1024, out_features=3072, bias=False)
          (up_proj): Linear(in_features=1024, out_features=3072, bias=False)
          (down_proj): Linear(in_features=3072, out_features=1024, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3RMSNorm((1024,), eps=1e-06)
        (post_attention_layernorm): Qwen3RMSNorm((1024,), eps=1e-06)
      )
    )
    (norm): Qwen3RMSNorm((1024,), eps=1e-06)
    (rotary_emb): Qwen3RotaryEmbedding()
  )
  (risk_level_category_pre): Linear(in_features=1024, out_features=512, bias=False)
  (risk_level_category_layernorm): Qwen3RMSNorm((512,), eps=1e-06)
  (risk_level_head): Linear(in_features=512, out_features=3, bias=False)
  (category_head): Linear(in_features=512, out_features=8, bias=False)
  (query_risk_level_category_pre): Linear(in_features=1024, out_features=512, bias=False)
  (query_risk_level_category_layernorm): Qwen3RMSNorm((512,), eps=1e-06)
  (query_risk_level_head): Linear(in_features=512, out_features=3, bias=False)
  (query_category_head): Linear(in_features=512, out_features=9, bias=False)
)

相比于Gen模型, 少了 lm_head头(不做next token预测), 多了 risk_level_head category_head query_risk_level_head query_category_head

image-20250926132904084

分类头的类别如下, Query类别(用户输入)多了Jailbreak类

  "response_risk_level_map": {"0": "Safe", "1": "Unsafe", "2": "Controversial"},

  "response_category_map": {"0": "Violent", "1": "Sexual Content", "2": "Self-Harm", "3": "Political", "4": "PII", "5": "Copyright", "6": "Illegal Acts", "7": "Unethical"},

  "query_risk_level_map": {"0": "Safe", "1": "Unsafe", "2": "Controversial"},

  "query_category_map": {"0": "Violent", "1": "Sexual Content", "2": "Self-Harm", "3": "Political", "4": "PII", "5": "Copyright", "6": "Illegal Acts", "7": "Unethical", "8": "Jailbreak"}

训练细节分析 - Query分类

首先是Query(用户输入分类)分类,这个较为简单,用户输入的是完整Query,无需流式分类。损失函数由两部分组成:类别损失、风险等级损失

image-20250926151852632

训练细节分析 - Response分类

这个是流式模型重点需要训练的,具体来说,对于模型的输出S = {S1, S2, · · · , Sn} ,目标是找到触发输出不安全内容的Token Si。选用了2种方式进行数据集构建

Rollout: 对每个token Si ,为其拼接前缀构造Pi = {S1, S2, . . . , Si}, 将其发送给不同的语言模型做next token预测。对于从前缀Pi调用语言模型生成的第j个响应,我们将其表示为Ri,j。拼接成完整的响应Ci,j=Pi⊕Ri,j 。随后调用Qwen3Guard-Gen来进行判断,若多数模型都输出了恶意内容,则将这个token Si定义为不安全的token

image-20250926152909235

LLM-as-Judge:另一种方式是调用大模型对前缀序列Pi进行直接判断,技术报告中说使用的是Qwen-3-235-A22B进行判断

image-20250926153116276

最终数据集

当Rollout和LLM-as-Judege都认为Si是不安全时,Si标记为Unsafe 。 其之后的Token被标记为原始的样本级别标签。其之前的token被标记为Safe。最终这个Response的每个token都会被标记

训练时,完整的Response 的每个token都会贡献损失

image-20250926153520360

5. 总结

Qwen3Guard的基模是语言模型,参数量大、有推理能力,可解释性强。Gen模型将分类任务转变为next token预测任务。Stream模型去除了lm-head,外接分类头,基于hidden stats进行多分类。

LlamaGuard测试

模型测试

选用Llama-Guard-3-1B 进行测试

模型加载

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "meta-llama/Llama-Guard-3-1B"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

数据准备

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "text", 
                "text": "如何制作炸弹?"
            },
        ],
    }
]

input_ids = tokenizer.apply_chat_template(
    conversation, return_tensors="pt"
).to(model.device)

prompt_len = input_ids.shape[1]

看一下输入给模型的具体内容

image-20250926164546200

让模型进行safe 或unsafe判断,若 unsafe ,则选择13个类别中的一个

预测结果

output = model.generate(
    input_ids,
    max_new_tokens=20,
    pad_token_id=0,
)
generated_tokens = output[:, prompt_len:]

print(tokenizer.decode(generated_tokens[0]))    

image-20250926164622984


文章参考:

博客地址: qwrdxer.github.io

欢迎交流, QQ: 1944270374. WX: qwrdxer


转载请注明来源,欢迎对文章中的引用来源进行考证,欢迎指出任何有错误或不够清晰的表达。可以在下面评论区评论,也可以邮件至 1944270374@qq.com